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Um user any information that is not already contained in rf, . Clearly, a 
belter design is to show only one. of the set of identical document,, but 

rhat violates the W. . . . 

Another simplification made by the VM Is to break up a complex in- 
formation need into a number of queries which are earn optimized .n 
isolation. In practce, a document can be highly relevant to the complex 
inlbiowkm need as a whole even if it not the ?P rt ^ 0 " ™ * 
termediale step. An example here is an information need that the u?er 
initially expresses using ambiguous words, tor example, the query Jaguar 
to search for information on the animal (as opposed to the ear), the op- 
timal response, to this query may be the presentation of document, that 
make the user aware of the ambiguity and permit disambiguation of the 
query, in contrast, The PRP would mandate the presentation of document* 
that ate highly relevant to cither the car or the animal. 

A third important caveat is that the probability ot relevance is only es- 
timated. Given the many simplifying assumptions we make in ^grunK 
probabilistic models for IR, we cannot completely trust the probabiU >■ 
estimates. One aspen of this problem is that the variance ot the esti- 
mate of probability of relevance may be an important piece of evidence 
in some retrieval contexts. For example, a user may prefer a document 
that wc are certain is probably relevant (low variance of P^babdity 
marc) to one whose estimated probability or relevance is higher, but that 

also has a higher variance of the estimate. 



The Vector Space Model 

The vector space model is one or the most widely used models for ad-hoc 
retrieval mainly because of its conceptual simplicity and the appeal ot 
the underlying metaphor of using spatial proximity for semantic proxim- 
ity. Documents and queries are represented m a high-dimensional space, 
in which each dimension of the space corresponds to a word in the doc- 
ument collection. The most relevant documents for a qncry are expected 
lo he those represented by the veetors closest to the query, thai is, doc- 
umcnts that use tfmilar words to the query. Rather than considering the 
magnitude of the. vectors, closeness is often calculated by jurt lookuJR »r 
angles and choosinR documents that enclose the smallest angle wtCb the 

*"to fiSrc is.3, wc show a vector space with two dimensions, corre- 
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Figure 153 A vector Rpace with two dimensions. The wo dimensions corre- 
spond to die terms car and insurance. One query and Ihree documents aft* 1 
rKprtAerued In the space 



spending ro the words car and insurance. The entities represented in the 
space are. the query Q represented by the vector (U-71,0.71), and three 
documents d if tf z , and wilh the following coordinates: (CU3,0<99>J> 
(0.8.0.6), and (0,99,0.13i. The coordinates or term weights are derived 
from occurrence counts as we will see below. Fur example, insurance may, 
Mave only a passing reference in d 1 while Lhere are several occurrences, 
:>f car - hence The low weight for insurance and Che high wcitfhc for car l 
In the context of information retrieval, the word term is used for bottr 
weirds and phrases. We say term weights rather than word weights be- 
;ause dimensions in the vector space model can correspond to phrase^ 
is well as words.) 

In the liKure, document d 2 has the smallest angle with q, so it. will b 
he top-ranked document in response 10 the query car insurance- This 1 
>ecause borh 'concepts' (car and insurance) arc salient in th and then 
ore have high weights. The other two documents also mention bot 
erais, but in each case one of them is not a centrally important term i 
he document. 



/ector similarity 

To do retrieval in the vector space model, documents arc tanked uccor& 
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o normalized correlation coefficient. Wc introduced the cosine as a measure 
n 0 f vtclor similarity in section 8.5.1 and repeat its definition here: 



T 



>} cos (q,d) = 



where and d arc n-dimenfilona] vectors in a real-valued Space, the space 
of fill terms in the case of the vector space model. We compule how well 
roc occurrence of term i (measured by qt and dfi correlates tn query and 
document and then divide by the Kuclid can length of the two vectors tci 
&cale for The magnitude of the individual q t and d\. 

Recall also from section that eoaine. and fcudidcan distance p 7 ive 
rise to the same ranking for normalized vectors: 

n 

i=i i-i 

n 

<~ 1 - 2 £ *W + 1 
n 

• 2d -£>yi) 
i = i 

So for a particular query q and any two documents d\ and rf z we have: 
cos(<?\ di ) ;> cuk(<7, c*~2) * lG - <h I < W-*l 

which implies that the rankings are The same. (We again assume normal- 
ized vectors here.) 

Jf the vectors are normalized, we can conjure the cosine ;u> a simple 
dor product Normalization is severally seen as a Rood thitig - otherwise 
lunger vectors (corresponding to longer documents) would have an unfair 
advantage* and gel ranked higher than shorter ones. (We leave it as an 
exercise to show that the vectors in figure 15.3 are normalized, that is, 

Term weighting 

We now turn to the question of how to weight words in the vector space 
model. One could just use the counr of a word in a document as its term 
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1 5 Topics in Informal ion JteO'faVat 



Symbol Definition 



ency idj 

frequency df, 
requcncy df 



number of occurrences of vy, to 

number of documents in rhc collection thai w/ occurs ixi 
total number oF occurrences of m'/ in the collection 



Table 15.3 i'hrcc quantities that arc commonly used in term weighting In in- 
formation rccricvaL 

Word Collection Frequency Document F rcyi ftncy 

Insurance ~ 10440 — 3997 

try * 10422 8760 

T*Mc 15.4 Term and documeal frequencies of two words 1n an example eor- 
pus. J J 



CY 
NT 

:v 



weifthl, but there are more effective methods oJ' renm weighting. The. 
basic information used in terra weighting is lerm frequency % document 
frequency, and sometimes collection frequency as defined in tahlc 15*3. 
Note that df/ < eft and that £jtf/j cf,. ir is also important to note 
thai document frequency and collection frequency can only be. used if 
there is a coliectioii This assumption is not always true, for example if 
collections are created dynamically by selecting several databases from 
a large set (as may be the case on one of the latfle on-line information 
services), and joiniup, them inlo a temporary collection. 

The information lliat is captured by term frequency is bow salient a 
worcj is within a given document. The higher ihe term frequency (the,; 
more often the word occurs) the more likely it fs that the word is a good 
description of the content of the document. Term frequency is usually 
dampened by a function like f(tf) - Ttf or f (tf) = 1 < log(tf>.if > 0 be- 
cause more occurrences of a word indicate hSflher Importance, but *>nt as 
much relative importance as the undampened count would suggest. For 
example, v 7 ? or 1 * Ion 3 beticr reflect the importance of a word with three, 
nrcurrencfc* Than the rouriT 3 Irsfclf. Thp document Is somewhat more im- 
portant than a document with one occurrence, bui not three rimes as 
importaac, \ 

The second quantity, document frequency, can be interpreted ajs anuv 
dicator of inforrnativencss. A scmantically focus $ed word will often occur 
several times In a document if it occurs at ail Scmanrically unfocussed 
words a re spread out homogeneously over all documents. An example 




PAGE 6/7 * RCVD AT 8/5(2004 6:24:16 PM [Eastern Daylight Time] * SVR:USPT0-EFXRF-1/1 * DNIS:8729306 * CSID:1 978 341 0136 * DURATION (mm-ss):04-12 



AUG-05-2004 THU 06:24 PM HBSR 



FAX NO. 1 978 341 0136 



P. 07 



1 5.2 The. Vector Space. Model 



543 



from a corpus of New York Times articles is the. words Insurunce and try 
in table 3 5.4. The two words have about Hie same collection frequency, 
the total number of occurences in die doannftW rollrrtta. But insur- 
ance occurs in only half as many documents as try. This is because the 
word try can be used when talking about almost any topic since one can 
try to do something in any context. In contrast, insurance refers to a 
narrowly defined concept that Is only relevant to a small sct or topics. 
Another property of semantioUly focusscd words is that, if they come 
up once in a document, rhpy often occur several rimes. Insurance occurs 
about three limes per document, averaged over documents ir occurs In al 
least once. This Is simply due to the fact that most articles about health 
insurance, car insurance or similar topics will refer multiple limes to the 
concept of insurance. 

One. way to combine a word's term frequency lf ( j and document ire- 
queucy df/ into a single weight is as follows: 

,) weighttfj) = j Q iftf,j = 0 

-- where JV is the total number of douuments. The first clause applies Tor 
words occurring in the document, whereas for words that do not appear 
(tf, » - 0), we isct wciRht(i, Jf) - a. 

Document frequency is also scaled logarithmically. The formula 
\oa # = log/V - logdff Rives full weight to words that occur m 1 doc 
anient (logN - logdf, = log N - log 1 - logN). A word that occurred in 
all documents would get zero weight (log N - log df ( - log N - log N = 0). 
T t his rorm of document frequency weighiing Is of tp.n called inverse doc- 
v umcnt frequency or idf weighting. More generally, the weighting scheme 
in (15 5) a" example of a larger family of so-called tf.idf weighting 
'" schemes. Kach such scheme can be characterized by its term occurrence 
weighting, its document frequency weighiing and its normalization. In 
one description scheme, we assign a letter code to each component oj 
the rf.1df scheme. The scheme in (15.5) can then be described as Hn 
for logarithmic occurrence count weighting (1), logarithmic document fre- 
quency weighting (t), and no normalization (n). Other weighting possi- 
bilities are listed in table 15.5. For example, "ami" is augmented term 
occurrence weighting, no document frequency weighting and no normal- 
ization. We refer to vector length normalization as cosine normalisation 
because the inner product between two length-normalized vectors (the 
quciy-documcui similarity measure used in the vector space model) is 



•it 
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